Lokale Dateien, recap von BeautifulSoup

Imports


In [1]:
import os
from bs4 import BeautifulSoup
import pandas as pd

Lesen wir alles, was in aktuellen Folder ist


In [2]:
os.listdir(os.curdir)


Out[2]:
['20171130_SIX_ManagementAktien.csv',
 '03 First steps with Pandas.ipynb',
 '.DS_Store',
 '05 Homework_SNF_ Data.ipynb',
 '04 First steps with stackoverflow.md',
 '01_html_Einblick.htm',
 '01_html_Einblick.md',
 '.ipynb_checkpoints',
 '02 Lokales einlesen, BeautifulSoup recap.ipynb']

Lesen wir das File ein, das wir wollen


In [3]:
file = open('01_html_Einblick.htm', 'r')
file = file.read()

In [4]:
file


Out[4]:
'<!DOCTYPE html>\n<hmtl>\n<body>\n<h1> Ich bin ein Titel</h1>\n<p> Hier schreibe ich jetzt einen kleinen Text</p>\n\n  <table style="width:10%">\n<tr>\n<th>Firstname</th>\n<th>Lastname</th>\n<th>Age</th>\n</tr>\n\n<td>Markus</td>\n<td>Schmitt</td>\n<td>50</td>\n</tr>\n\n<tr>\n<td>Susanne</td>\n<td>Peters</td>\n<td>94</td>\n\n<td>Markus</td>\n<td>Schmitt</td>\n<td>50</td>\n</tr>\n\n<tr>\n<td>Michael</td>\n<td>Peters</td>\n<td>22</td>\n\n<td>Ahmad</td>\n<td>Islam</td>\n<td>30</td>\n</tr>\n\n<tr>\n<td>Michaela</td>\n<td>Holmstadt</td>\n<td>35</td>\n\n</tr>\n</table>\n<p>\n<a href=\'https://www.w3schools.com/html/html_elements.asp\'> Link auf mehr HMTL Code</a>\n</body>\n</hmtl>\n'

Führen wir BeautifulSoup ein


In [5]:
BeautifulSoup(file, 'html.parser')


Out[5]:
<!DOCTYPE html>

<hmtl>
<body>
<h1> Ich bin ein Titel</h1>
<p> Hier schreibe ich jetzt einen kleinen Text</p>
<table style="width:10%">
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<td>Markus</td>
<td>Schmitt</td>
<td>50</td>
</table></body></hmtl>
<tr>
<td>Susanne</td>
<td>Peters</td>
<td>94</td>
<td>Markus</td>
<td>Schmitt</td>
<td>50</td>
</tr>
<tr>
<td>Michael</td>
<td>Peters</td>
<td>22</td>
<td>Ahmad</td>
<td>Islam</td>
<td>30</td>
</tr>
<tr>
<td>Michaela</td>
<td>Holmstadt</td>
<td>35</td>
</tr>

<p>
<a href="https://www.w3schools.com/html/html_elements.asp"> Link auf mehr HMTL Code</a>
</p>

In [6]:
file_soup = BeautifulSoup(file, 'html.parser')

In [8]:
lst = file_soup.find_all('td')

In [15]:
pd.DataFrame([{'Vorname':'Markus', 'Nachname': 'Peters', 'Alter': 89}, {'Vorname':'Susanne', 'Nachname': 'Peters', 'Alter': 94}])


Out[15]:
Alter Nachname Vorname
0 89 Peters Markus
1 94 Peters Susanne

In [ ]:
[{'Vorname':'Markus', 'Nachname': 'Peters', 'Alter': 89}, 
 {'Vorname':'Susanne', 'Nachname': 'Peters', 'Alter': 94}]

In [18]:
lst[0].text


Out[18]:
'Markus'

In [60]:
new_lst = []
for x in lst:
    new_lst.append(x.text)

In [61]:
new_lst


Out[61]:
['Markus',
 'Schmitt',
 '50',
 'Susanne',
 'Peters',
 '94',
 'Markus',
 'Schmitt',
 '50',
 'Michael',
 'Peters',
 '22',
 'Ahmad',
 'Islam',
 '30',
 'Michaela',
 'Holmstadt',
 '35']

In [62]:
new_lst[0::3]


Out[62]:
['Markus', 'Susanne', 'Markus', 'Michael', 'Ahmad', 'Michaela']

In [63]:
nachnamen_liste = new_lst[1::3]

In [64]:
alters_liste = new_lst[2::3]

In [65]:
vornamen_liste = new_lst[0::3]

In [66]:
vornamen_liste


Out[66]:
['Markus', 'Susanne', 'Markus', 'Michael', 'Ahmad', 'Michaela']

In [80]:
vn_list = []
for vorname, alter, nachname in zip(vornamen_liste, alters_liste, nachnamen_liste):
    mini_dict = {'Vorname':vorname,
                 'Nachname': nachname,
                 'Alter': alter}
    vn_list.append(mini_dict)

In [81]:
df = pd.DataFrame(vn_list)

In [82]:
df


Out[82]:
Alter Nachname Vorname
0 50 Schmitt Markus
1 94 Peters Susanne
2 50 Schmitt Markus
3 22 Peters Michael
4 30 Islam Ahmad
5 35 Holmstadt Michaela

In [ ]:


In [ ]:


In [ ]:


In [76]:
df['Alter'] = alters_liste

In [78]:
df['Nachnamen'] = nachnamen_liste

In [79]:
df


Out[79]:
Vorname Alter Nachnamen
0 Markus 50 Schmitt
1 Susanne 94 Peters
2 Markus 50 Schmitt
3 Michael 22 Peters
4 Ahmad 30 Islam
5 Michaela 35 Holmstadt

In [ ]:


In [50]:
df['Alter'] = alters_lst


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-50-700bed527e62> in <module>()
----> 1 df['Alter'] = alters_lst

~/.virtualenvs/master/lib/python3.5/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   2329         else:
   2330             # set column
-> 2331             self._set_item(key, value)
   2332 
   2333     def _setitem_slice(self, key, value):

~/.virtualenvs/master/lib/python3.5/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   2395 
   2396         self._ensure_valid_index(value)
-> 2397         value = self._sanitize_column(key, value)
   2398         NDFrame._set_item(self, key, value)
   2399 

~/.virtualenvs/master/lib/python3.5/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
   2566 
   2567             # turn me into an ndarray
-> 2568             value = _sanitize_index(value, self.index, copy=False)
   2569             if not isinstance(value, (np.ndarray, Index)):
   2570                 if isinstance(value, list) and len(value) > 0:

~/.virtualenvs/master/lib/python3.5/site-packages/pandas/core/series.py in _sanitize_index(data, index, copy)
   2877 
   2878     if len(data) != len(index):
-> 2879         raise ValueError('Length of values does not match length of ' 'index')
   2880 
   2881     if isinstance(data, PeriodIndex):

ValueError: Length of values does not match length of index

In [ ]:


In [45]:
df


Out[45]:
Vorname Alter
0 Markus {'Alter': 'Markus'}
1 Susanne {'Alter': 'Susanne'}
2 Markus {'Alter': 'Markus'}
3 Michael {'Alter': 'Michael'}
4 Ahmad {'Alter': 'Ahmad'}
5 Michaela {'Alter': 'Michaela'}

In [ ]:


In [ ]:


In [18]:
lst = file_soup.find_all('td')

In [20]:
for elem in lst:
    print(elem.text)


Markus
Schmitt
50
Susanne
Peters
94
Markus
Schmitt
50
Michael
Peters
22
Ahmad
Islam
30
Michaela
Holmstadt
35

Wir bringen wir das jetzt in die geeignete Form?

Nur ein möglicher Ansatz


In [22]:
lst[::3]


Out[22]:
[<td>Markus</td>,
 <td>Susanne</td>,
 <td>Markus</td>,
 <td>Michael</td>,
 <td>Ahmad</td>,
 <td>Michaela</td>]

In [23]:
lst[1::3]


Out[23]:
[<td>Schmitt</td>,
 <td>Peters</td>,
 <td>Schmitt</td>,
 <td>Peters</td>,
 <td>Islam</td>,
 <td>Holmstadt</td>]

In [24]:
lst[2::3]


Out[24]:
[<td>50</td>, <td>94</td>, <td>50</td>, <td>22</td>, <td>30</td>, <td>35</td>]

In [27]:
final_lst = []
for n,nn,a in zip(lst[::3], lst[1::3], lst[2::3]):
    mini_dict = {'Vorname':n.text,
                 'Nachname':nn.text,
                 'Alter':a.text}
    final_lst.append(mini_dict)

In [28]:
pd.DataFrame(final_lst)


Out[28]:
Alter Nachname Vorname
0 50 Schmitt Markus
1 94 Peters Susanne
2 50 Schmitt Markus
3 22 Peters Michael
4 30 Islam Ahmad
5 35 Holmstadt Michaela

In [ ]: